Lesson 4


Scatterplots and Perceived Audience Size

library(ggplot2)
df <- read.csv('pseudo_facebook.tsv', sep='\t')
qplot(x=age, y=friend_count, data=df)


What are some things that you notice right away?

Response: There seems to a general larger spread amongst younger users. ***

ggplot Syntax

ggplot(aes(age,friend_count), data=df) + geom_point() + xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).

summary(df$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

Overplotting

ggplot(aes(age,friend_count), data=df) + geom_jitter(alpha=.05) + xlim(13,90)
## Warning: Removed 5168 rows containing missing values (geom_point).


Coord_trans()

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

ggplot(aes(age,friend_count), data=df) + geom_point(alpha=.05, position=position_jitter(h=0)) + xlim(13,90) + coord_trans(ytrans='sqrt')
## Warning: Removed 5176 rows containing missing values (geom_point).

What do you notice?

The shape of the data changed. The older you are, the less likely you are to have few friends. ***

Alpha and Jitter

ggplot(aes(age,friendships_initiated), data=df) + geom_point(alpha=.05, position=position_jitter(h=0)) + xlim(13,90) + coord_trans(ytrans='sqrt')
## Warning: Removed 5184 rows containing missing values (geom_point).


Overplotting and Domain Knowledge


Conditional Means

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
age_groups <- group_by(df, age)
df.fc_by_age <- df %>%
  group_by(age) %>%
  summarise(friend_count_mean = mean(friend_count),
            friend_count_median = median(friend_count),
            n=n()) %>%
            arrange(age)
head(df.fc_by_age)
## Source: local data frame [6 x 4]
## 
##   age friend_count_mean friend_count_median    n
## 1  13          164.7500                74.0  484
## 2  14          251.3901               132.0 1925
## 3  15          347.6921               161.0 2618
## 4  16          351.9371               171.5 3086
## 5  17          350.3006               156.0 3283
## 6  18          331.1663               162.0 5196

Create your plot!

ggplot(aes(age,friend_count_mean), data=df.fc_by_age) + geom_line()


Overlaying Summaries with Raw Data

ggplot(aes(age,friend_count), data=df) + geom_point(alpha=0.05, position=position_jitter(h=0)) + geom_point(alpha=0.05, position = position_jitter(h=0), color='orange') + geom_line(stat='summary', fun.y=mean) + geom_line(stat='summary', fun.y=quantile, probs=.1, linetype = 2, color = 'blue') + geom_line(stat='summary', fun.y=quantile, probs=.9, linetype = 2, color = 'blue') + geom_line(stat='summary', fun.y=quantile, probs=.5, color = 'blue') + coord_cartesian(xlim = c(13, 70), ylim=c(0,1000))

What are some of your observations of the plot?

Response: The median friend count doesn’t change much with age. ***

Correlation

cor.test(df$age, df$friend_count, method='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  df$age and df$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737
with(df, cor.test(age,friend_count, method='pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737

What’s the correlation between age and friend count? Round to three decimal places. Response: -0.027 ***

Correlation on Subsets

with(subset(df, age<=70), cor.test(age, friend_count))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

Create Scatterplots

ggplot(aes(www_likes_received, likes_received), data=df) + geom_point()


Strong Correlations

ggplot(aes(www_likes_received, likes_received), data=df) + geom_point() + xlim(0, quantile(df$www_likes_received, .95)) + ylim(0, quantile(df$www_likes, .95)) + geom_smooth(method='lm',color='red')
## Warning: Removed 11608 rows containing missing values (stat_smooth).
## Warning: Removed 11608 rows containing missing values (geom_point).
## Warning: Removed 33 rows containing missing values (geom_path).

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places.

cor.test(df$www_likes, df$www_likes_received, method ='pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  df$www_likes and df$www_likes_received
## t = 97.523, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2903590 0.3017253
## sample estimates:
##       cor 
## 0.2960526

More Caution with Correlation

#install.packages('alr3')
library(alr3)
## Loading required package: car

Create your plot!

data("Mitchell")
ggplot(aes(Month, Temp), data=Mitchell) + geom_point()


Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot. 0

  2. What is the actual correlation of the two variables? (Round to the thousandths place)

cor.test(Mitchell$Month, Mitchell$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063

Making Sense of Data

ggplot(aes(Month, Temp), data=Mitchell) + geom_point() + scale_x_discrete(breaks = seq(0,203,12)) + coord_fixed(ratio=.5)


A New Perspective

What do you notice? Response: There’s a cyclical pattern!


Understanding Noise: Age to Age Months

df$age_with_months <- df$age + (12 - df$dob_month)/12.

Age with Months Means

df.fc_by_age_months <- df %>%
  group_by(age_with_months) %>%
  summarise(friend_count_mean = mean(friend_count),
            friend_count_median = median(friend_count),
            n=n()) %>%
            arrange(age_with_months)

Noise in Conditional Means

ggplot(aes(age_with_months,friend_count_mean), data = subset(df.fc_by_age_months, age_with_months<71)) + geom_point()


Smoothing Conditional Means

ggplot(aes(age_with_months,friend_count_mean), data = subset(df.fc_by_age_months, age_with_months<71)) + geom_point() + geom_smooth()
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.